1 Reference proteomes
1.1 SGD
Saccharomyces Genome Database
The reference genome sequences of the widely used Baker's yeast are maintained by tbe SGD, corresponding to the strain S. cerevisiae S288C.
The SGD also gather many key information such as:
- genome assembly
- Protein-coding sequence
- gene/protein nomenclature
- biological activity (interaction/regulation/complexes)
- curated annotations and descriptins (gene ontology)
- gene and protein expression
- accurate scientific literature,
- genotype/phenotype experiments ...
# YEAST REFERENCE SEQUENCES
## SGD
S288C = load.sgd.proteome(withORF=T,rm.stop=F) # Reference SGD protein sequences
S288C.cds = load.sgd.CDS(withORF=T) # Reference coding sequences (CDS)
SGD = load.sgd.features() # main SGD gene/protein features1.2 UniProt
Universal Protein resource
UniProt provides the most comprehensive resource of protein sequence and functional information.
Most protein features are typically stored on the database called UniProt KnowledgeBase (UniProtKB).
This database is a major hub for many specialized databases and webservers that helps understand proteins role.
Their reference proteome gather proteins for which expression or proof of physical existence has been shown.
## UNIPROT
# saveRDS(load.uniprot.features(),"data/uniprot-features.rds") # Takes >250sec for ~6700 ids
UNI.SC = load.uniprot.proteome('yeast') # Reference UniprotKB sequences (Uniref)
UNI = readRDS("data/uniprot-features.rds") %>% # main UNIPROT gene/protein features
group_by(UNIPROTKB) %>%
mutate( is_uniref = UNIPROTKB %in% names(UNI.SC), one2one = (n()==1) ) 1.3 YK11
1,011 Saccharomyces cerevisiae isolates
The 1002 Yeast Genomes project constitutes the most comprehensive genomic dataset on a single species of yeast. Ultimately provide the most extensive view of the genetic and phenotypic diversity within this model species to date.
The project was originally slated to include 1002 strains from diverse global locations, as well as a variety of ecological sources. A total of 1011 whole genome sequences have been produced and published in J. Peter et al., Nature 2018.
## 1011 strains
riboseq_strains = c('AMH','BAN','BED','BPL','BTI','CMP','CPI','CQC') # Strains with riboseq data (on 14/01/21)
strains.info = load.peter2018.data() %>% # strains info from supp mat of Science paper
mutate( has_riboseq = standardized.name %in% riboseq_strains)
## REF: J. Peter et al., 2018, Science
## Genome evolution across 1,011 Saccharomyces cerevisiae isolates
##
-
/
# saveRDS(load.1011.strains(),"data/proteome-1011-strains.rds") # Takes >5mn for 6575 ids
YK11 = readRDS("data/proteome-1011-strains.rds") # 1011 strains proteomes sequencesAll strains read were aligned on the reference yeast genome (ATCC 204508 / S288c). Therefore, all sequenced ORF should have exactly the same length as the references.
1.4 Statistics
Below we calculate some general statistics on the reference yeast proteome.
id_orf=names(S288C)
map_sgd2orf = SGD %>% dplyr::filter(name %in% id_orf) %>% dplyr::select(sgdid,name)
id_sgd = map_sgd2orf$sgdid
UNI.1 = UNI # Backup of Uniprot features table
UNI = left_join(UNI,map_sgd2orf, by=c('SGD'='sgdid')) %>% dplyr::rename(ORF=name)
id_uni = names(UNI.SC)
map_uni2sgd = UNI %>% dplyr::filter(UNIPROTKB %in% id_uni & one2one & !is.na(SGD)) %>% dplyr::select(UNIPROTKB,SGD,ORF)
id_y1k = names(YK11)
n_sgd = length(S288C) %>% as.character()
n_uni = length(UNI.SC) %>% as.character()
n_y1k = length(YK11) %>% as.character()Sequences of both SGD and UNIPROTKB proteomes are quite similar, yet, they don't share the same number of proteins.
SGD contains 6713 sequences while UNIPROTKB has 6049 sequences.
uniseq = UNI.SC[map_uni2sgd$UNIPROTKB]
sgdseq = rm.stop(S288C[map_uni2sgd$ORF])
aliref = align.pair.prot(uniseq, sgdseq) # Takes ~20s for 6042 ids
#saveRDS(pairwise.alignment.to.df(aliref),"data/sgduni-aligned-proteome.rds")
# Takes ~8mn for 6042 ids
ALI_REF = readRDS("data/sgduni-aligned-proteome.rds")
ALI_STAT = group_by(ALI_REF,s1) %>%
summarise( not_aligned=sum(!aligned), OL = mean(ol.12), PID = mean(pid.long) )
## OVERALL PROTEOME ALIGNMENT
# 99.93% (+/- 1.44%) identity on average for 6014 proteins
n_aligned = sum(ALI_STAT$PID==100) %>% as.character()
pid_avg = mean(ALI_STAT$PID) %>% round(2) %>% as.character()
pid_sd = sd(ALI_STAT$PID) %>% round(2) %>% as.character()
## FOR MISALIGNED PROTEINS
n_misaligned = sum(ALI_STAT$PID<100) %>% as.character()
# 85.97% (+/- 16.10%) identity on average for 28 proteins
pid_avg.2 = mean(ALI_STAT$PID[ALI_STAT$not_aligned!=0]) %>% round(2) %>% as.character()
pid_sd.2 = sd(ALI_STAT$PID[ALI_STAT$not_aligned!=0]) %>% round(2) %>% as.character()
n_unaligned = sum(ALI_STAT$not_aligned) %>% as.character()On average, the mapped SGD/UNIPROT sequences share 99.93% identity (+/- 1.44%)
6014 proteins are exactly identical (i.e. 100% identity) between the two proteomes.
28 proteins contains mismatches/indels for at least 2811 residues.
The misaligned proteins share on average 85.97% identity (+/- 16.1%)
2 Proteome datasets
2.1 Annnotations
Gene ontology, SGD description, Uniprot commments
The protein localizations can be obtained from curated annotations from Gene Ontology (GO) and the subcellular locations from Uniprot.
GO = get.uniprot.go(id_uni)
##
##
## 'select()' returned 1:many mapping between keys and columns
## Joining, by = c("UNIPROT", "GO")
CC = GO %>% dplyr::filter(ONTOLOGY == CC)
LOC.m = get.uniprot.localization(annot = UNI,loc_to_columns = T)
LOC.df = get.uniprot.localization(annot = UNI,loc_to_columns = F)
DESC = get.uniprot.sgd(id_uni)
## 'select()' returned 1:many mapping between keys and columns
DESC.UNI = DESC %>%
left_join(UNI, by=c('UNIPROT'='UNIPROTKB','ORF'='ORF','SGD'='SGD'))
CAT.ORF=list()
CAT.ORF$has_intron = SGD %>% filter(type=='intron') %>% pull(parent) %>% intersect(id_orf)
ESS = load.vanleeuwen2020.data()
## REF: Van Leeuwen et al., 2020, Molecular Systems Biology
## Systematic analysis of bypass suppression of essential genes
CAT.ORF$essential = ESS %>%
filter(disp !='Dispensable') %>%
filter(disp == 'Core' | (KO_exp =='Indispensable' & disp_score < 0.4 )) %>%
pull(ORF)
CAT.ORF$dispensable = ESS %>% filter(disp =='Dispensable' ) %>% pull(ORF)2.2 Conservation
Site-specific evolutionary rates
Rate4Site is a program for detecting conserved amino-acid sites by computing the relative evolutionary rate for each site in the multiple sequence alignment (MSA).
For the fungi lineage, we calculated the evolutionary rates of the orthologs defined in Wapinsky et al., Nature 2007.
For the S. cerevisiae population, we calculated the rate of evolution at every site of the 1011 isolates proteomes.This rate is expected to be better than a non-synonymous SNP frequency.
Indeed, the phylogenetic tree of the 1011 strains is taken into account to balance for SNPs that occur in closely related strains relative to SNPs coming from the most recent common ancestor (MRCA).
WAP.r4s = load.wapinsky2007.data() # Evolutionary rate for fungi lineage (~20s to load)
## Get rate4site data for yeast based on wapinsky 2007 fungi lineage: 19.66 sec elapsed
#saveRDS(load.rate4site_1011.data(), "data/rate4site-1011-strains.rds") (~3mn to load)
K11.r4s = readRDS("data/rate4site-1011-strains.rds") # Evolutionary rate for 1011 strains
EVO = load.aligned.data(data.path="data/" ) # Protein dataset with aligned evolutionary rate
##
## Merged dataset with residue-level informations from:
## (1) SGD S288C proteome sequence
## (2) Uniprot reference yeast proteome (aligned to SGD)
## (3) Dubreuil et al. (2019) for Abundance, Disorder (IUP+D2P2) & stickiness (+aaindex)
## (4) Rate4Site based on fungi lineage Wapinsky et al. (2007) of 14 yeast species
## (5) 3Dcomplex for yeast quaternary structures (based on X-Ray from PDB)
FUNGI = WAP.r4s %>%
group_by(r4s_orf,r4s_size) %>%
mutate(R4S_norm = R4S_res + abs(min_(R4S_res))) %>%
summarise(r4s.fungi = mean_(R4S_norm),
max_r4s.fungi = max_(R4S_norm)
)
POP = K11.r4s %>% dplyr::filter( !is.na(ID) ) %>%
group_by(orf,len.s288c) %>%
summarise(r4s.yk11 = mean_(SCORE),
max_r4s.yk11 = max_(SCORE)
)
R4S = left_join(POP,FUNGI, by=c('orf'='r4s_orf', 'len.s288c'='r4s_size')) %>%
mutate( log10.r4s.fungi = log10(r4s.fungi),
log10.r4s.yk11 =log10(r4s.yk11))
CAT.ORF$ohnologs = load.byrne2005.data()$orf
## REF: Byrne and Wolfe, 2005, Genome Research
## The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species
# FIND OUTLIERS
# fit = lm(log10.r4s.fungi ~ log10.r4s.yk11, data = R4S) %>% broom::augment_columns(data=R4S,interval='confidence')
# cutoff.outlier= 20*mean(fit$.cooksd)
# R4S = fit %>% mutate(outlier = .cooksd > cutoff.outlier)2.3 Expression
Protein expression is taken from Ho et al., Cell Systems, 2018 which attempts to unify protein abundance datasets from the Protein Abundance database. The unified protein expression is given in molecules per cell (MPC). The coefficient of variation is calculated as:
\[\displaystyle CV = \frac{\sigma}{\mu}\]
where:
sigma (\(\sigma\)) corresponds to the MPC standard deviation across datasets
mu (\(\mu\)) corresponds is the MPC average across datasets
from all datasets is estimated corrected to account for bias due to each method (MS,GFP,TAP)
tRNA adapation index (tAI) is calculated from the coding sequence (CDS) from SGD as in Dos Reis et al., NAR, 2004.
Below is the crossed-correlations between all measurements of gene/protein expression:
TAI = load.codon.usage(S288C.cds)
## REF: M. Dos Reis et al., 2004, Nucleic Acids Research
## Solving the riddle of codon usage preferences: a test for translational selection
## Loading required package: tAI
PAB = load.ho2018.data()
## REF: B. Ho et al., 2018, Cell Systems
## Unification of Protein Abundance Datasets Yields a Quantitative Saccharomyces cerevisiae Proteome
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
gene_exp = c('tAI','mean.mpc','median.mpc','GFP.avg','MS.avg')
log10_exp = paste0("log10_",gene_exp[-1])
EXP = left_join(TAI,PAB, by=c('prot'='orf')) %>%
mutate( across(all_of(gene_exp[-1]), .fns = log10, .names = "log10_{col}" ) )
gg_point <- wrap(ggally_points, size = 0.5, color = MAIN.COLOR, alpha=0.3)
gg_cor <- wrap(ggally_cor, title='r', method = "spearman",
color=MAIN.COLOR, bg='black', display_grid = FALSE)
F2 = ggpairs(data = EXP,
columns=c(gene_exp[1],log10_exp),
upper = list(continuous = gg_cor),
lower = list(continuous = gg_point),
diag = list(continuous = wrap("densityDiag", col=MAIN.COLOR, lwd=0.5))
)
F23 Yeast SNP
Using the translated protein sequences from the 1011 isolates proteomes, I calculate the frequency of amino acids at every site from each ORF across strains.
Some sites are ambiguous between strains, meaning that some strains may have an alternative amino-acid at that position.
3.1 SNP from 1011 strains
Since, I am looking at amino-acid frequencies we can only consider non-synonymous mutations.
Indels are not incorporated in the strains sequences.
SNP_FREQ=round(1/1011,4)
wt.file = 'data/YK11_WT.rds'
var.file = 'data/YK11_VAR.rds'
snppos.file = 'data/YK11_SNP_pos.rds'
P = tibble( orf=names(YK11),
n_strains=lengths(YK11),
len = sapply(widths(YK11),unique)
) %>% left_join(get.width(S288C), by=c('orf'='orf'),suffix=c('','.s288c')) %>%
mutate( match_wt = len == len.s288c )
VAR = preload(var.file, loading.call = map_df(YK11, get.variants,verbose=F),doing='Find all variants...')
SNP = get.variants.to.SNP(VAR) %>%
add_count(id,ref_pos,name='nvar') %>%
group_by(id,ref_pos) %>%
mutate( alt_cumfr=sum(alt_fr)) %>%
left_join(P,by=c('id'='orf'))
WT = get.positions(S288C) %>%
group_by(orf) %>%
mutate(bin.pos=dplyr::ntile(wt_pos,100))
PROT_SNP = left_join(SNP,WT, by=c('id'='orf','ref_pos'='wt_pos','len.s288c'='len')) %>%
mutate(wt_low = alt_aa == wt_aa,
wt_missing=is.na(wt_aa),
early_stop = (alt_aa == "*" & ref_pos != len),
dSTI.ref=get.score.mutation(ref_aa,alt_aa),
dSTI.wt=get.score.mutation(wt_aa,alt_aa))
BRK.FREQ = c(0,0.0001,1e-3,1e-2,5e-2,1e-1,5e-1)
LAB.FREQ = paste.pair(trimws(sprintf("%.4f",BRK.FREQ), whitespace = '0', which = 'right'),s='-')
PROT_SNP$fr_bin = cut(PROT_SNP$alt_fr,breaks = BRK.FREQ,labels = LAB.FREQ,ordered_result = T,include.lowest = T)
PROT_SNP$has_ortho = PROT_SNP$id %in% PROT_SNP$r4s_orf # filter for orthologs
# COUNT ALL
singletons = PROT_SNP$nvar==1
earlystop = PROT_SNP$early_stop
TOTAL_SNP_VAR = nrow(PROT_SNP) %>% as.character()
TOTAL_SNP_POS = n_distinct(PROT_SNP$id,PROT_SNP$ref_pos) %>% as.character()
TOTAL_SNP_SINGLE = sum(singletons) %>% as.character()
TOTAL_SNP_MULTI = sum(!singletons) %>% as.character()
TOTAL_SNP_STOP = sum_(earlystop) %>% as.character()SNP is defined as the segregating site where alternative amino acids (i.e variants) are observed across strains.
The "reference" residue is the amino acid with the highest frequency across all strains.
3.2 SNP count
A total of 574400 variants have been identified spread over 513876 positions of the proteome.
Out of those SNPs, there are 458173 singletons, i.e. an alternative amino acid was acquired in a single strain.
The remaining 116227 sites have at least 2 residues that coexist in the population.
In total, there are 6858 SNP that are premature STOP codons.
snp_count = PROT_SNP %>%
group_by(nvar) %>%
summarize(var_count=sum(nvar)) %>%
ungroup() %>% mutate(var_mean = var_count / sum(var_count))
snp_byfreq = PROT_SNP %>%
group_by(nvar,fr_bin) %>%
summarize( var_count_byfreq = n()) %>%
group_by(fr_bin) %>%
mutate(var_total_byfreq= sum(var_count_byfreq)) %>%
ungroup() %>%
mutate( var_mean_byfreq = var_count_byfreq / sum(var_count_byfreq),
total_mean = cumsum(var_mean_byfreq )) %>%
arrange(total_mean)
p0.0 = ggplot(snp_count, aes(x=factor(nvar))) +
geom_bar(aes(y=var_count),stat='identity') +
xlab('# parallel variants') + ylab('Count')
p0.1 = ggplot(snp_count, aes(x=factor(nvar))) +
geom_bar(aes(y=var_mean),stat='identity') +
xlab('# parallel variants') + ylab('variants %') +
scale_y_continuous(labels = scales::percent)
plot(p0.0)plot(p0.1)
p1.0 = ggplot(snp_byfreq, aes(x=factor(nvar),group=fr_bin)) +
geom_bar(aes(y=var_count_byfreq,fill=fr_bin),stat='identity') +
scale_fill_wsj(name='SNP%') + theme(legend.position = 'top', legend.text = element_text(size=9))+
xlab('# parallel variants') + ylab('Count')
p1.1 = p1.0 + facet_wrap(~fr_bin,scales = 'free_y') + theme(legend.pos='non')
plot(p1.0)plot(p1.1)
p2.0 = ggplot(snp_byfreq, aes(x=fr_bin,fill=fr_bin)) +
geom_bar(aes(y=var_mean_byfreq),stat='identity')+
scale_y_continuous(labels = scales::percent) +
scale_fill_wsj(name='SNP%') + scale_color_wsj() + theme(legend.position = 'top', legend.text = element_text(size=9))+
scale_x_discrete('Variant frequency bin') + ylab('% variants')
p2.1 = ggplot(snp_byfreq, aes(x=1,fill=fr_bin)) +
geom_bar(aes(y=var_mean_byfreq),stat='identity', position = 'stack') +
geom_text_repel(data=subset(snp_byfreq,nvar==1),segment.size = 1, direction = "y",
mapping=aes(y=var_mean_byfreq,x=0,color=fr_bin,label=sprintf("%.f%% (%s)\n",100*var_mean_byfreq,var_total_byfreq)),
position='stack',fontface='bold',show.legend = F)+
scale_fill_wsj(name='SNP%') + scale_color_wsj() + theme(legend.position = 'top', legend.text = element_text(size=9))+
scale_y_continuous(labels = scales::percent) +
scale_x_discrete('Variant frequency bin') + ylab('% variants')
plot(p2.0)plot(p2.1)
# Histogram for number of SNP
p3.0 = get_snp_orf_plot(FREQMIN = 0)
p3.1 = get_snp_orf_plot()
plot(p3.0)plot(p3.1)Similarly, I calculated the number of "early-STOP" codons, i.e. STOP codons which occur before the last amino acid.
stop_count = PROT_SNP %>% dplyr::filter(alt_fr>0.1) %>%
group_by(id) %>%
summarize( nstop = sum(early_stop),len=mean(len)) %>%
left_join(DESC,by=c('id'='ORF')) %>%
left_join(EXP,by=c('id'='prot')) %>%
mutate(has_introns = (id %in% CAT.ORF$has_intron))
# Histogram for number of early STOP
F6 = ggplot(stop_count,aes(x=nstop)) +
geom_histogram(binwidth = 1, fill=MAIN.COLOR,color='black') +
xlab("early-STOP count") + ylab('# ORFS') + scale_y_log10()
plot(F6)
# early STOP count vs Protein Length
F7 = ggplot(stop_count %>% filter(nstop >0)) +
geom_point(aes(y=nstop,x=log10_mean.mpc,text=id,func=FUNCTION,color=factor(has_introns)), shape=21,size=0.8) +
xlab("Protein Length (AA)") + ylab('early-STOP count')
ggplotly(F7)3.3 SNP count vs Protein Length
p4.0 = get_snp_protlen_plot(get_snp_protlen(PROT_SNP,0))
## Number of groups: 1
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(XX)` instead of `XX` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(YY)` instead of `YY` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(BY)` instead of `BY` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(ID)` instead of `ID` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## removing NAs...
p4.1 = get_snp_protlen_anim(animated=T,psnp=PROT_SNP, width = 1920/2, height = 1080/2, units = "px")
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
## Number of groups: 1
## removing NAs...
plot(p4.0)p4.1
# SNP count along protein length
p5.0=get_snp_pos_plot(FREQMIN = 0)
p5.1=get_snp_pos_plot(PROT_SNP)
plot(p5.0)plot(p5.1)3.4 SNP vs. evolution
R4S.desc = left_join(R4S,DESC,by=c('orf'='ORF')) %>%
left_join(SGD,by=c('SGD'='sgdid')) %>%
filter(complete.cases(r4s.yk11,r4s.fungi)) %>%
ungroup() %>%
mutate( has_introns=orf %in% CAT.ORF$has_intron,
has_ohnologs=orf %in% CAT.ORF$ohnologs,
sizerange= cut_number(len.s288c,n = 4),
is_dispensable = orf %in% CAT.ORF$dispensable,
is_essential = orf %in% CAT.ORF$essential
)# %>%
#left_join(LOC.df,by=c('UNIPROT'='id')) %>%
#add_count(loc,name = 'nloc') %>% filter(nloc > 100)
# CC=cor.sub.by(R4S.desc, XX='log10.r4s.yk11',YY='log10.r4s.fungi',BY='loc', ID = 'orf', na.rm = T) %>%
# mutate( toshow=sprintf("r=%.3f\np=%.1e\nn=%s",r,p,N) )
C6.0 = spearman.toplot(Y=R4S.desc$log10.r4s.fungi,X=R4S.desc$log10.r4s.yk11)
p6.0 = ggplot(R4S.desc, aes(y=log10.r4s.fungi,x=log10.r4s.yk11,text=orf,func=FUNCTION)) +
geom_point(fill=MAIN.COLOR,alpha=0.4,shape=21,size=0.8) +
#geom_point(data=subset(R4S.desc,has_introns),color='red',fill=NA,shape=1,size=1.4,stroke=0.2)+
#geom_point(data=subset(R4S.desc,has_ohnologs),color='blue',fill=NA,shape=1,size=1.4,stroke=0.2)+
#geom_point(data=subset(R4S.desc,is_essential),color='green',fill=NA,shape=1,size=1.4,stroke=0.2)+
#geom_point(data=subset(R4S.desc,is_dispensable),color='purple',fill=NA,shape=1,size=1.4,stroke=0.2)+
geom_text(data=C6.0, aes(x=Inf,y=-Inf,label=toshow),color=MAIN.COLOR,size=4,inherit.aes = F,hjust='inward',vjust='inward') +
#geom_text(data=subset(R4S.desc,outlier), mapping=aes(label=orf),color='white',size=2) +
xlab("1011 strains evolutionary rate") + ylab('Fungi Evolutionary rate')
# facet_wrap(~loc)
ggplotly(p6.0)
snp_count_orf=get_snp_orf(PROT_SNP,0) %>%
mutate(has_ortho = id %in% CAT.ORF$has_intron,
is_essential= id %in% CAT.ORF$essential,
is_dispensable = id %in% CAT.ORF$dispensable,
has_ohnolog = id %in% CAT.ORF$ohnologs)
SNP.PROT = left_join(snp_count_orf, EXP , by=c('id'='prot')) %>% # Add expression data
left_join(R4S, by=c('id'='orf')) %>% # Add conservation data
left_join(DESC,by=c('id'='ORF')) # Add annotation data
# SNP count vs Evolutionary rate fungi
C6.1 = spearman.toplot(SNP.PROT$sites,SNP.PROT$log10.r4s.fungi)
p6.1 = ggplot(SNP.PROT) +
geom_point(aes(x=sites,y=log10.r4s.fungi,text=id,func=FUNCTION), color=MAIN.COLOR, fill=NA,shape=21,size=0.8) +
geom_text(C6.1, mapping=aes(x=1,y=0.5,label=toshow),hjust='inward',vjust='inward',check_overlap =T,size=4) +
xlab('SNP count') + ylab("Fungi Evolutionary rate") +
scale_x_continuous(trans='log2')
ggplotly(p6.1)
# SNP count vs Evolutionary rate 1011 strains
C6.2 = spearman.toplot(SNP.PROT$sites,SNP.PROT$log10.r4s.yk11)
p6.2= ggplot(SNP.PROT) +
geom_point(aes(x=sites,y=log10.r4s.yk11,text=id,func=FUNCTION), color=MAIN.COLOR, fill=NA,shape=21,size=0.8) +
geom_text(C6.2, mapping=aes(x=1,y=0.5,label=toshow),hjust='inward',vjust='inward',check_overlap =T,size=4) +
xlab('SNP count') + ylab("1011 strains evolutionary rate") +
scale_x_continuous(trans='log2')
ggplotly(p6.2)3.5 SNP vs. expression
# SNP count vs Median protein expression
C7.0 = cor.sub.by(SNP.PROT, "log10_median.mpc", "sites", 'has_ortho') %>%
mutate( toshow = sprintf("r %.3f \n p %.1e \n n %s",r,p,n))
## Number of groups: 0
## removing NAs...
p7.0 = ggplot(SNP.PROT) +
geom_point(aes(y=sites,x=log10_median.mpc,text=id,func=FUNCTION), color=MAIN.COLOR, fill=NA,shape=21,size=0.8) +
geom_text(C7.0, mapping=aes(x=0,y=1,label=toshow),hjust='inward',vjust='inward',check_overlap =T,size=4) +
xlab("Molecules per cell (log10 median)") + ylab('SNP count') +
scale_y_continuous(trans='log2') + facet_wrap(~has_ortho,labeller = label_both)
ggplotly(p7.0 )# Evolutionary rate 1011 strains vs Median protein expression
C7.1 = cor.sub.by(SNP.PROT, "log10_median.mpc", "log10.r4s.yk11", 'has_ortho') %>%
mutate( toshow = sprintf("r %.3f \n p %.1e \n n %s",r,p,n))
## Number of groups: 0
## removing NAs...
p7.1 = ggplot(SNP.PROT) +
geom_point(aes(y=log10.r4s.yk11,x=log10_median.mpc,text=id,func=FUNCTION), color=MAIN.COLOR, fill=NA,shape=21,size=0.8) +
geom_text(C7.1, mapping=aes(x=0,y=-4,label=toshow),hjust='inward',vjust='inward',check_overlap =T,size=4) +
ylab("1011 strains evolutionary rate") + xlab('Molecules per cell (log10 median)') + facet_wrap(~has_ortho,labeller=label_both)
ggplotly(p7.1)
# Evolutionary rate fungi vs Median protein expression
C7.2 = spearman.toplot(SNP.PROT$log10_median.mpc,SNP.PROT$log10.r4s.fungi)
p7.2 = ggplot(SNP.PROT) +
geom_point(aes(y=log10.r4s.fungi,x=log10_median.mpc,text=id,func=FUNCTION), color=MAIN.COLOR, fill=NA,shape=21,size=0.8) +
geom_text(C7.2, mapping=aes(x=6,y=0.5,label=toshow),hjust='inward',vjust='inward',check_overlap =T,size=4) +
ylab("fungi evolutionary rate") + xlab('Molecules per cell (log10 median)')
ggplotly(p7.2)3.6 Variant viewer (Testing... in progress)
References
Jackson et al.: Genome evolution across 1,011 Saccharomyces cerevisiae isolates
Rate4site: Comparison of site-specific rate-inference methods: Bayesian methods are superior
Wapinski et al.: Natural history and evolutionary principles of gene duplication in fungi
Ho et al.: Unification of Protein Abundance Datasets Yields a Quantitative Saccharomyces cerevisiae Proteome
Dos Reis et al.: Solving the riddle of codon usage preferences: a test for translational selection
Session Info
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_IL.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_IL.UTF-8 LC_COLLATE=en_IL.UTF-8
## [5] LC_MONETARY=en_IL.UTF-8 LC_MESSAGES=en_IL.UTF-8
## [7] LC_PAPER=en_IL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] shiny_1.5.0 broom_0.7.2 gifski_0.8.6
## [4] gganimate_1.0.7 GGally_2.0.0 tAI_0.2
## [7] GO.db_3.10.0 org.Sc.sgd.db_3.10.0 openxlsx_4.1.5
## [10] hablar_0.3.0 Biostrings_2.54.0 forcats_0.5.0
## [13] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
## [16] readr_1.3.1 tidyr_1.0.3 tibble_3.0.4
## [19] tidyverse_1.3.0 tictoc_1.0 plotly_4.9.3
## [22] ggrepel_0.8.2 ggpubr_0.3.0 ggsci_2.9
## [25] ggthemes_4.2.0 gridExtra_2.3 ggplot2_3.3.2
## [28] AnnotationDbi_1.48.0 hutils_1.6.0 XVector_0.26.0
## [31] IRanges_2.20.2 S4Vectors_0.24.4 Biobase_2.46.0
## [34] BiocGenerics_0.32.0 xfun_0.20
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ggsignif_0.6.0 ellipsis_0.3.0 rio_0.5.16
## [5] fs_1.4.1 rstudioapi_0.11 farver_2.0.3 bit64_0.9-7
## [9] fansi_0.4.1 lubridate_1.7.9 xml2_1.3.2 knitr_1.28
## [13] jsonlite_1.6.1 dbplyr_1.4.3 compiler_3.6.2 httr_1.4.1
## [17] backports_1.1.6 fastmap_1.0.1 assertthat_0.2.1 lazyeval_0.2.2
## [21] cli_2.0.2 tweenr_1.0.1 later_1.0.0 prettyunits_1.1.1
## [25] htmltools_0.5.0 tools_3.6.2 gtable_0.3.0 glue_1.4.2
## [29] fastmatch_1.1-0 Rcpp_1.0.4.6 carData_3.0-3 cellranger_1.1.0
## [33] vctrs_0.3.4 crosstalk_1.0.0 rvest_0.3.5 mime_0.9
## [37] lifecycle_0.2.0 rstatix_0.5.0 zlibbioc_1.32.0 scales_1.1.1
## [41] hms_0.5.3 promises_1.1.0 RColorBrewer_1.1-2 yaml_2.2.1
## [45] curl_4.3 memoise_1.1.0 reshape_0.8.8 stringi_1.4.6
## [49] RSQLite_2.2.0 zip_2.1.1 rlang_0.4.7 pkgconfig_2.0.3
## [53] evaluate_0.14 htmlwidgets_1.5.3 labeling_0.3 bit_1.1-15.1
## [57] tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5 bookdown_0.21
## [61] R6_2.4.1 generics_0.0.2 DBI_1.1.0 pillar_1.4.4
## [65] haven_2.2.0 foreign_0.8-72 withr_2.4.1 abind_1.4-5
## [69] modelr_0.1.7 crayon_1.3.4 car_3.0-6 rmarkdown_2.6
## [73] progress_1.2.2 grid_3.6.2 readxl_1.3.1 data.table_1.12.8
## [77] blob_1.2.1 rmdformats_1.0.0 reprex_0.3.0 digest_0.6.25
## [81] xtable_1.8-4 httpuv_1.5.2 munsell_0.5.0 viridisLite_0.3.0